Model Selection

Multimodal retrieval

# Multimodal retrieval

Jina Embeddings V4

Jina Embeddings v4 is a general-purpose embedding model designed for multimodal and multilingual retrieval, especially suitable for retrieving complex documents, including visually rich documents containing charts, tables, and illustrations.

Multimodal Fusion

Transformers Other

CLIP ViT H 14 Laion2b S32b B79k

This is a vision-language model based on the OpenCLIP framework, trained on the LAION-2B English subset, excelling in zero-shot image classification and cross-modal retrieval tasks.

CLIP ViT B 32 Laion2b S34b B79k

A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval

ColPali is a visual retrieval model based on PaliGemma-3B and the ColBERT strategy, used to efficiently index documents from visual features.

Safetensors English

Patentclip RN101

Zero-shot image classification model based on OpenCLIP library, suitable for patent image analysis

Image Classification

CLIP ViT B 32 Laion2b S34b B79k

CLIP ViT-B/32 model trained on the LAION-2B dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks

CLIP ViT B 32 DataComp.XL S13b B90k

This is a CLIP ViT-B/32 model trained on the DataComp-1B dataset, designed for tasks like zero-shot image classification and image-text retrieval.

CLIP ViT B 32 256x256 DataComp S34b B86k

This is a CLIP ViT-B/32 model trained on the DataComp-1B dataset using the OpenCLIP framework at 256x256 resolution, primarily for zero-shot image classification and image-text retrieval tasks.

CLIP ViT B 16 DataComp.XL S13b B90k

This is a CLIP ViT-L/14 model trained on the DataComp-1B dataset, supporting zero-shot image classification and image-text retrieval tasks.

CLIP ViT L 14 DataComp.XL S13b B90k

This model is a CLIP ViT-L/14 trained on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval tasks.

CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Soup

CLIP ConvNeXt-XXLarge model trained on LAION-2B dataset using OpenCLIP framework, the first non-ViT image tower CLIP model achieving >79% ImageNet top-1 zero-shot accuracy

CLIP Convnext Large D 320.laion2B S29b B131k Ft

CLIP model based on ConvNeXt-Large architecture, trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks.

CLIP Convnext Large D 320.laion2B S29b B131k Ft Soup

CLIP model based on ConvNeXt-Large architecture, trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks

CLIP Convnext Large D.laion2b S26b B102k Augreg

Large-scale ConvNeXt-Large CLIP model trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks

CLIP Convnext Base W 320 Laion Aesthetic S13b B82k

A CLIP model based on the ConvNeXt-Base architecture, trained on a subset of LAION-5B, suitable for zero-shot image classification and image-text retrieval tasks.

CLIP Convnext Base W Laion Aesthetic S13b B82k

CLIP model with ConvNeXt-Base architecture trained on the LAION-Aesthetic dataset, supporting zero-shot image classification and cross-modal retrieval tasks

CLIP Convnext Base W Laion2b S13b B82k

CLIP model based on ConvNeXt-Base architecture, trained on a subset of LAION-5B, supporting zero-shot image classification and image-text retrieval tasks

CLIP ViT B 16 Laion2b S34b B88k

A multimodal vision-language model trained on the OpenCLIP framework, completed on the LAION-2B English dataset, supporting zero-shot image classification tasks

CLIP ViT B 32 Laion2b S34b B79k

A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase